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Evaluating Equity at the Local Level Using Bootstrap Tests 


Introduction 

Because of concerns about test security, different test forms are typically used across 
different testing occasions. As a result, equating is necessary in order to get scores from 
the different test forms that can be used interchangeably. In order to assure the quality of 
equating, multiple equating methods are often examined. Various equity properties have 
been used to assess the adequacy of different equating methods. For example, following the 
"weak" definition of equity (Morris, 1982), first-order equity refers to two alternate forms 
having similar expected scores given true examinee ability. Second-order equity refers to two 
alternate forms having a similar error of measurement given true examinee ability. Thus, first- 
and second-order equity properties examine the first and the second central moments of the 
conditional distributions of test scores at given levels of examinee ability. 

The discrepancy indices and (defined in more detail below) are often used as evaluation 
criteria for satisfying equity properties (Kim, Brennan, & Kolen, 2005; Tong & Kolen, 2005; 

Lee, Lee, & Brennan, 2010). These discrepancy indices evaluate equity at the global level 
as a form of weighted average difference in the expected scale scores or error variance of 
measurement between alternate forms over all levels of examinee ability. Thus, these indices 
provide an overall summary of whether equity properties are satisfied between two alternate 
forms over the entire range of scale scores. 

If there is a specific range of scale scores that requires specific attention (for example, 
cut scores for licensure tests and scholarship competitions), then it would be advisable to 
evaluate equity at the local rather than the global level. At the local level, however, there 
are no formal statistical tests or evaluation criteria available to assess equity, apart from 
examining graphs. Furthermore, it is challenging to conduct formal statistical tests of the 
differences in the conditional mean and variance between two test forms because the 
statistical tests would require a strong assumption about the conditional distribution of the 
test scores at each examinee ability level. Therefore, the current study proposes a method 
that evaluates equity properties using the bootstrap technique, which allows for a statistical 
test of equity at the local level without making any distributional assumptions. 

The bootstrap tests for the equity properties are developed by following the bootstrap 
procedure discussed in Boos and Brownie (1989). The current study demonstrates the 
bootstrap tests using large-scale assessment data in which the scores are used to determine 
eligibility for scholarship for students. The equity properties for the assessment, in particular the 
cut scores in which the scholarships are determined, are examined within an IRT framework 
(Kolen, Zeng, & Hanson, 1996). In terms of equating design, the current study focuses on the 
common item nonequivalent groups equating design. Four equating methods — IRT true score, 
IRT observed score, frequency estimation (FE), and chained equipercentile (CE) method — are 
compared. 


Evaluating Equity Properties and 
Evaluation Criteria 

If equating is appropriate, the distribution of scale scores should be the same on the old 
form and the new form after equating. Thus, the adequacy of equating is often evaluated by 
assessing two equity properties — first-order equity and second-order equity. The equity 
properties hold if examinees at a given true score have the same expected scale score 
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(first-order equity) and the same conditional error of measurement (second-order equity) on 
the two forms (Tong & Kolen, 2005). 

For tests with dichotomously scored items, several procedures have been developed to 
examine equity properties (Kolen, Hanson, & Brennan, 1992; Kolen et al., 1996). Kolen et al. 
(1996) presented a procedure for estimating the expected values and conditional standard 
errors of measurement of scale scores in an IRT framework. According to Kolen et al., for a 
test with /C items, the first-order equity refers to the true expected scale score ^{6) at given 
examinee ability 6, which can be written as 


m = m^) 1 0 ] = isii)P(x = i 1 6), 

/=0 

in which s{X) is the raw to scale score transformation function, and P{X=i \ 6) is the probability 
of the raw score random variable X being equal to i {i = 0, 1, ... K), which can be calculated 
using Lord and Wingersky's recursive formula (1984). Regarding second-order equity, the 
conditional standard error of measurement (CSEM) of a scale score given 6 can be written as 

(y{s{X) 1 6 ) = ^E{[s{X)-mf \e} = nff)- 

V i=0 


After estimating expected scale scores and CSEMs, previous research used the overall 
discrepancy indices — and — as evaluation criteria to demonstrate whether the 
equity properties had been preserved (Tong & Kolen, 2005; Lee et al., 2010). refers to the 
expected scale scores, and refers to the CSEMs. By following Lee et al. (2010) for two 
forms — Form W and Form S — and can be written as follows: 

Xw,|£{5C^|0,)-£(5CJ0.)l 





in which B(SCJ0J and E{SC^6) are the expected scale scores from the two forms. Form W 
and Form S, at quadrature point 6., EVJ^O. and EV^O. are the conditional error variance of 
measurements for the two forms at quadrature point 6. and w. is the weight of 6.. Smaller 
values of and imply small discrepancies between two forms in terms of the first-order 
equity and the second-order equity, and thus imply that the equity properties are preserved. 

Although and provide information on whether the equity properties are preserved, 
the discrepancy indices only summarize the information at the global level, that is, over the 
entire range of the examinee ability distribution. If a particular level of examinee ability is of 
interest, these overall summary indices do not provide this information. Instead of a statistical 
test, graphical inspection at the examinee ability level of interest may be used. In practice, 
ensuring the preservation of equity properties at particular examinee ability levels is often of 
great importance for test fairness when the ability levels are used for cut scores of, among 
other things, a licensure test, scholarship competition, or college admission. In those cases, 
a formal statistical test for assessing these equity properties is beneficial. 


College Board Research Reports 


7 



Evaluating Equity at the Local Level Using Bootstrap Tests 


Bootstrap Tests for Equity Properties 

Boos and Brownie (1989) suggested bootstrap test procedures introduced by Efron (1979) 
to test for the homogeneity of variance without making any assumptions about the sample 
distribution. For example, the commonly used F test of equality variance, requires a 

normal distribution assumption. If the distribution is unknown, however, the test may not be 
appropriate. Thus, Boos and Brownie proposed a bootstrap technique that allows one to use 
normal theory test statistics, such as the F test for equal variance between two samples, 
without making a normality assumption. 

The bootstrap technique by Boos and Brownie is applicable to testing for equity properties, 
in particular, testing for second-order equity, which is related to the CSEMs from two 
test forms. To test whether the second-order equity holds, the variance of the conditional 
scale score distribution at a given level of examinee ability in one test form needs to 
be compared with the variance in the other test form. Since the parametric form of the 
conditional scale score distribution given examinee ability is not necessarily known, the 
bootstrap procedure can be applied to test the equality of CSEMs from the two test 
forms. 

Based on the bootstrap procedure discussed in Boos and Brownie, the proposed bootstrap 
tests for first-order equity and second-order equity are as follows: 

• Draw k independent samples from the original sample with replacement. 

• Compute the difference test statistics T., andT 2 for each bootstrap sample, and obtain 
the distributions of T., for first-order equity andT 2 for second-order equity: 

o T^z = E(SCp)-E(SCp), 

o Jj=(EVp)/(EVp)J^l,2,...k 

in which E(SCjO) is the expected mean given examinee ability 0 for form j and (EVj^O) is 
the error variance of measurement at 0 for form j. 

• Construct a 95% interval for and T 2 using the k samples. 

• Examine whether the interval includes zero for the first-order equity and one for the 
second-order equity. 


Method 

Data 

The current study demonstrates the bootstrap tests for the equity properties using a random 
sample from a pre-2015 PSAT/NMSQT® test administration. The PSAT/NMSQT assessment 
is a norm-referenced test designed primarily for 10th- and llth-grade students. The pre-2015 
PSAT/NMSQT includes three test areas: critical reading, mathematics, and writing. The 
primary intended uses are as a low-stakes assessment in preparation for taking the SAT® and 
as a high-stakes assessment for llth-grade students to determine eligibility to participate in 
the National Merit Scholarship competition. The PSAT/NMSQT is a co-sponsored assessment 
between the College Board and the National Merit Corporation. The PSAT/NMSQT is 
administered in the fall, on a Wednesday and following Saturday of the same week, with 
separate forms. 
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The pre-2015 PSAT/NMSQT inherits many of its content and psychometric properties from 
the SAT. The reported PSAT/NMSQT score scale ranged from 20 to 80 in increments of 1, 
for a total of 61 scale score points. The PSAT/NMSQT score scale was primarily maintained 
through the maintenance of the SAT score scale. The parent pre-March-2016 SAT forms are 
equated to previously administered SAT forms. Qnce the pre-2015 PSAT/NMSQT forms are 
administered, they were then equated back to their parent pre-March-2016 SAT forms using 
the common item nonequivalent groups equating design. At least 60%-65% of the total 
items in the pre-2015 PSAT/NMSQT are used as common items. See Cook, Dunbar, & Eignor 
(1981) for details on the equating design of the pre-2015 PSAT/NMSQT forms. 

Analyses 

For each pre-2015 PSAT/NMSQT test form (Form W and S), a random sample with 
N = 10,000 was obtained. For its SAT parent forms, a random sample with N = 5,000 was 
used. To conduct the bootstrap tests, the following steps were used: 

• Step 1 : For each form, a bootstrap sample was drawn from the original sample with 
replacement. 

• Step 2: The items in the bootstrap sample were calibrated using flexMlRT version 2.0 
(Cai, 2013). The 3 parameter logistic IRT model was used for item calibration. 

• Step 3: Qnce item calibration was completed, items from the two PSAT/NMSQT forms 
were placed on the SAT scale using the Flaebara method (Flaebara, 1980). 

• Step 4: Equating was conducted using four methods — IRT true score, IRT observed 
score, frequency estimation (FE), and chained equipercentile (CE) method. 

• Step 5: Based on the equating results, the expected rounded scale scores and 
conditional standard errors of measurement were computed to examine the 
first-order equity and the second-order equity. The difference test statistics, T., andT 2 , 
were computed. 

• Step 6: Step 1 to 5 were repeated 1,000 times. 


Results 

The descriptive statistics of the raw scores from the random samples of the pre-2015 
PSAT/NMSQT (N = 10,000) and pre-March-2016 SAT (N = 5,000) forms are given inTable 1. 

In addition, the effect size^ between the PSAT/NMSQT and SAT forms is presented. While 
the main test populations of the PSAT/NMSQT are based on 10th- and 11 th-graders, the 
SAT primary test populations are from 11th- and 12th-graders. Thus, the examinee ability 
of the SAT takers tends to be higher than that of the PSAT/NMSQT test-takers. Due to this 
difference in test-taker populations between the PSAT/NMSQT and SAT, the observed effect 
sizes between the PSAT/NMSQT and SAT were quite large. 


1. The effect size was computed using the following: 


Effect Size = 
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Table 1. 

Summary Statistics for Raw Scores 


Test Section 

PSAT/NMSQT 

SAT 

Effect Size 


Form W 

SATl 



N \ 

i Mean \ 

SD \ 

; Min 

Max 

N 

; Mean \ 

SD \ 

Min 

Max 


Critical Reading 

10,000 

22.76 

8.63 

1 

48 

5,000 

\ 34.82 

12.45 : 

6 

67 

-1.13 

Writing 

10,000 

18.64 ; 

7.88 

1 

39 

5,000 

; 27.30 

9.01 ; 

5 

49 

-1.02 

Math 

10,000 

17.12 i 

8.11 

1 

38 

5,000 

28.39 

11.26 

2 

54 

-1.15 


Form S 

SAT 2 



N 

Mean \ 

SD 

Min 

Max 

N 

Mean 

SD \ 

Min 

Max 


Critical Reading 

10,000 

28.36 

8.16 

5 

48 

5,000 

\ 35.54 ; 

12.58 : 

5 

67 

-0.68 

Writing 

10,000 

23.44 

7.11 

3 

39 

5,000 

; 27.30 

9.23 ; 

3 

49 

-0.47 

Math 

10,000 

21.73 ; 

6.96 

2 

38 

5,000 

: 29.24 

11.11 : 

3 

54 

-0.81 


The two PSAT/NMSQT forms were equated to their SAT parent forms, which had been 
already equated, using a common-item non-equivalent groups design; that is, the PSAT/NMSQT 
Form W was equated to SAT Form 1 , and the PSAT/NMSQT Form S was equated to SAT 
Form 2. Table 2 shows the discrepancy indices — and — for the results of the rounded 
scale score equating using four equating methods. The IRT equating (true score and observed 
score equating) methods seem better than the FE and CE methods because the IRT methods 
had smaller values than those of either the FE or CE method for all three sections. Qn 
the other hand, the FE and CE methods had smaller values than the IRT methods. Thus, 
the two IRT methods seemed to preserve the first-order equity better than the FE and CE 
methods while the FE and CE methods seemed to preserve the second-order equity better 
than the IRT methods. The second-order equity seemed to be slightly better preserved for the 
Critical Reading and Writing than for Math. 


Table 2. 

Discrepancy Indices for Rounded Scale Score Equating 


Discrepancy Index 

0 , 

0 . 

Section 

CR 

M 

W 

CR 

M 

W 

IRT True 

0.31 

0.23 

0.23 

1.01 

1.43 

0.87 

IRTObs 

0.32 

0.11 

0.18 

0.96 

1.27 

0.86 

FE 

0.41 

0.60 

0.27 

0.80 

1.26 

0.90 

CE 

0.37 

0.54 

0.24 

0.83 

1.25 

0.77 


The plots of expected scale scores for both PSAT/NMSQT forms as well as the difference 
in the expected scale scores between the two forms are shown in Figures 1 to 4 for Critical 
Reading, Figures 13 to 16 for Math, and Figures 25 to 28 for Writing. Qverall, the differences 
in expected scores between the two forms were between -1 and +^ across all examinee 
ability levels for all equating methods except for the FE and CE methods in Writing. This small 
overall discrepancy of the expected scale scores between the two forms indicates that the 
first-order equity was well preserved. Although the overall difference was small, where the 
most differences occurred in the range of examinee ability varied across equating methods. 
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Figure 1. 


First order equity using IRT true score equating for Critical Reading. 



Figure 2. 


First order equity using IRT observed score equating for Critical Reading. 
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Figure 3. 

First order equity using frequency estimation equating for Critical Reading. 
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Figure 4. 

First order equity using chained equipercentile equating for Critical Reading. 
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The plots of CSEMs for the two forms and the ratio of the CESMs between two forms are 
presented in Figures 5-8 for Critical Reading, Figures 17-20 for Math, and Figures 29-32 for 
Writing. As with the expected scale score difference, the difference in CSEMs between the 
two forms was small — the ratio ranges between 0.5 and 1.5 for all equating methods. The 
ratio tended to be larger at the higher level of the examinee ability. 


Figure 5. 


Second order equity using IRT true score equating for Critical Reading. 
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Figure 6. 


Second order equity using IRT observed score equating for Critical Reading. 
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Figure 7. 

Second order equity using frequency estimation equating for Critical Reading. 
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Figure 8. 

Second order equity using chained equipercentile equating for Critical Reading. 
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After examining the plots of expected scale scores and the ratio of the CSEM, a natural 
question to ask is 'Are these differences significant?" The bootstrap test can answer this 
question. Tables 3 and 4 summarize the results of the bootstrap tests for equity properties. 
Figures 9-12 for Critical Reading, Figures 21-24 for Math, and Figures 33-36 for Writing show 
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the 95% confidence interval for the first-order equity and the second-order equity bootstrap 
tests. Using 95% confidence interval band plots, the local exanninee ability levels that do 
not contain zero for the first-order equity, and one for the second-order equity can be easily 
detected for each equating nnethod. 


Figure 9. 


Bootstrap test for equity for IRT true score equating for Critical Reading. 
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Figure 10. 

Bootstrap test for equity for IRT observed score equating for Critical Reading. 
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Figure 11. 


Bootstrap test for equity for frequency estimation equating for Critical Reading. 
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Figure 12. 


Bootstrap test for equity for chained equipercentile equating for Critical Reading. 
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Figure 13. 


First order equity using IRT true score equating for Math. 



Figure 14. 


First order equity using IRT observed score equating for Math. 
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Figure 15. 

First order equity using frequency estimation equating for Math. 
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Figure 16. 

First order equity using chained equipercentile equating for Math. 
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Figure 17. 


Second order equity using IRT true score equating for Math. 
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Figure 18. 


Second order equity using IRT observed score equating for Math. 
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Figure 19. 


Second order equity using frequency estimation equating for Math. 




Figure 20 
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Figure 21. 


Bootstrap test for equity for IRT true score equating for Math. 




iniHVBi fer R«1« «r CendMenBi Error VArtannc^ oi hMiif urHnonti 


I 

h 

I 


i 

1 ^ 




f 

r 

I 

i 


Figure 22. 

Bootstrap test for equity for IRT observed score equating for Math. 
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Figure 23. 

Bootstrap test for equity for frequency estimation equating for Math. 
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Figure 24. 


Bootstrap test for equity for chained equipercentile equating for Math. 
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Figure 25. 


First order equity using IRT true score equating for Writing. 



Figure 26. 


First order equity using IRT observed score equating for Writing. 
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Figure 27. 

First order equity using frequency estinnation equating for Writing. 
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Figure 28. 

First order equity using chained equipercentile equating for Writing. 
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Figure 29. 


Second order equity using IRT true score equating for Writing. 
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Figure 30. 


Second order equity using IRT observed score equating for Writing. 
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Figure 31. 


Second order equity using frequency estimation equating for Writing. 
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Figure 32. 

Second order equity using chained equipercentile equating for Writing. 
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Figure 33. 

Bootstrap test for equity for IRT true score equating for Writing 
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Figure 34. 


Bootstrap test for equity for IRT observed score equating for Writing. 
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Figure 35. 


Bootstrap test for equity for frequency estimation equating for Writing. 
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In Table 3, at a given exanninee ability level, each equating nnethod has a value of zero if the 
95% bootstrap confidence interval for testing the first-order equity includes zero and has a 
value of one if the confidence interval does not include zero. In Table 4, each equating nnethod 
has a value of zero if the 95% bootstrap confidence interval for testing the second-order 
equity includes one while it has a value of one if the confidence interval does not include one. 
The last two rows provide a summary of the flags used to indicate whether the first-order 
equity or the second-order equity property was preserved at each examinee level. The first 
row contains the total number of flags, and the second row contains the number of flags 
where examinee ability level is higher than two ('Flag at > 2.0'). It was decided to examine the 
examinee ability level of two or higher because the examinee ability level of two is equivalent 
to approximately 70 on the 20 to 80 pre-2015 PSAT/NMSQT scale, and this score range 
is usually important for the scholarship competition. Flags at greater than or equal to two 
appeared to vary depending on the equating method. These flags may help to decide which 
equating method will be selected since the score range is important for the PSAT/NMSQT. It 
is sensible to select an equating method that produces smaller numbers of flags at the score 
level that the assessment is most concerned about. 

In practice, the discrepancy indices are often examined to assess multiple equating methods. 
A method providing smaller discrepancy indices is considered to be a better method in 
terms of preserving equity properties. Flowever, smaller discrepancy indices may not provide 
sufficient information for a specific score level of interest. For example, for Critical Reading, 
the IRT true score equating method has the smallest index (Table 2). In Figure 1, the 
difference in the expected scale scores between the two forms seemed small. Thus, one 
might select the IRT true score equating as the final equating method. In fact, this result 
is consistent with the previous research findings that the IRT true score equating method 
preserves first-order equity compared to other equating methods (Kim et al., 2005; Tong & 
Kolen, 2005; Lee et al., 2010). Flowever, the bootstrap test for first-order equity revealed that 
the confidence interval band for the examinee ability score between 2.0 and 3.0, which is 
important for the scholarship competition, did not include zero, indicating that the discrepancy 
between the two forms at that ability level is significant for IRT true score equating. With 
respect to the first-order equity at the local level, equating methods other than IRT true score 
equating should be used. 

Limitations of the discrepancy index were also found for second-order equity. The index in 
Table 2 shows that for Math, overall, the FE and CE methods satisfied second-order equity 
better than the IRT methods. However, the confidence interval band for the examinee ability 
score at around 3.0 or higher did not include one for the FE and CE methods (Figures 23 
and 24), while the confidence interval band for the same examinee ability level included one 
for the IRT true and observed score equating methods. Thus, the IRT methods preserved 
second-order equity for the higher examinee ability scores better than the FE or CE methods, 
even though overall the FE or CE methods preserved second-order equity better than the IRT 
methods. 


College Board Research Reports 29 



Evaluating Equity at the Local Level Using Bootstrap Tests 


Table 3. 

Bootstrap Test for First Order Equity on Rounded Score 


CR 

Math 

Writing 

theta 

IRT 

IRT i 

FE ; 

: CE 

IRT 

IRT i 

FE 

: CE 

IRT 

IRT i 

FE 

CE 


true 

obs 

true 

obs 

true 

obs 

- 4.0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

- 3.8 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

- 3.6 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

- 3.4 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

- 3.2 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

- 3.0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

- 2.8 

0 

0 

0 

0 

1 

0 

1 

0 

0 

0 

1 

0 

- 2.6 

0 

0 

0 

0 

1 

0 

1 

0 

0 

0 

1 

0 

- 2.4 

0 

0 


0 

1 

0 

1 

0 

0 

0 

1 

0 

- 2.2 

0 

0 


0 

1 

0 

1 

0 

0 

0 

1 

0 

- 2.0 

0 

0 


0 

0 

0 

1 

1 

0 

0 

1 

0 

- 1.8 

0 

0 


0 

0 

0 

1 

1 

0 

0 

1 

0 

- 1.6 

0 

0 

1 

1 

0 

0 

1 

1 

0 

0 

1 

0 

- 1.4 

0 

0 


1 

0 

0 

1 

1 

0 

0 

1 

0 

- 1.2 

0 

0 


0 

0 

0 

1 

0 

0 

0 

0 

0 

- 1.0 

0 

0 


0 

0 

0 

1 

0 

0 

0 

0 

0 

- 0.8 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

- 0.6 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

- 0.4 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

- 0.2 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0.0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0.2 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0.4 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0.6 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0.8 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1.0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

1.2 

0 

0 

1 

1 

0 

0 

0 

0 

0 

0 

0 

0 

1.4 

0 

0 

1 

1 

0 

0 

0 

0 

0 

0 

0 

0 

1.6 

0 

0 

1 

1 

0 

0 

0 

0 

0 

0 

0 

0 

1.8 

1 

1 

1 

1 

0 

0 

0 

0 

0 

0 

0 

0 

2.0 

1 

1 

1 

1 

0 

0 

0 

0 

0 

0 

0 

0 

2.2 

1 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

2.4 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

2.6 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

2.8 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

3.0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

3.2 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

3.4 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

3.6 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

3.8 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

4.0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

Total Flag 

6 

3 

i 13 

8 

9 

0 

14 

4 

0 

0 

8 

0 

Flag at >= 

5 

2 

1 

1 

1 

0 

0 

0 

0 

0 

0 

0 

2.0 
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Table 4. 

Bootstrap Test for Second Order Equity on Rounded Score 


CR 

Math 

Writing 

theta 

IRT 

IRT i 

FE ; 

: CE 

IRT 

i IRT i 

FE i 

CE 

IRT 

IRT i 

FE 

CE 


true 

obs 

true 

obs 

true 

obs 

- 4.0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

- 3.8 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

- 3.6 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

- 3.4 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

- 3.2 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

- 3.0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

1 

0 

0 

- 2.8 

0 

0 

0 

0 

1 

0 

0 

0 

0 

1 

0 

0 

- 2.6 

0 

0 

0 

0 

1 


0 

0 

0 

1 

0 

0 

- 2.4 

0 

0 

0 

0 

1 


0 

0 

0 

0 

0 

0 

- 2.2 

0 

0 

0 

0 

1 


0 

0 

0 

0 

0 

0 

- 2.0 

0 

1 

0 

0 

1 


1 

1 

0 

0 

0 

0 

- 1.8 

0 

1 

0 

0 

1 


1 

1 

0 

0 

0 

0 

- 1.6 

1 

0 

0 

0 

1 


1 

1 

0 

0 

0 

0 

- 1.4 

1 

0 

0 

0 

1 


1 

1 

0 

0 

0 

0 

- 1.2 

1 

0 

1 

0 

1 


1 

1 

0 

0 

0 

1 

- 1.0 

0 

0 

1 

1 

1 


1 

1 

0 

0 

1 

1 

- 0.8 

0 

0 

1 

1 

1 


1 

1 

0 

0 

1 

1 

- 0.6 

0 

0 

1 

1 

1 


1 

1 

0 

0 

1 

1 

- 0.4 

0 

0 

0 

0 

1 


1 

1 

0 

0 

1 

0 

- 0.2 

0 

1 

0 

1 

1 


1 

1 

0 

0 

0 

0 

0.0 


1 

1 

1 

1 


1 

1 

0 

0 

0 

0 

0.2 


1 

1 

1 

1 


1 

1 

0 

0 

1 

0 

0.4 


1 

1 

1 

1 


1 

1 

0 

0 

0 

0 

0.6 


1 

1 

1 

1 


0 

1 

0 

0 

0 

0 

0.8 


1 

1 

1 

0 

1 

0 

0 

0 

0 

0 

0 

1.0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1.2 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1.4 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1.6 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1.8 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

2.0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

2.2 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

2.4 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

2.6 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

2.8 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

3.0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

3.2 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

3.4 

0 

0 

0 

0 

0 

0 

1 

1 

0 

0 

0 

0 

3.6 

0 

0 

0 

0 

0 

0 

1 

1 

0 

0 

0 

0 

3.8 

0 

0 

0 

0 

0 

0 

1 

1 

0 

0 

0 

0 

4.0 

0 

0 

0 

0 

0 

0 

1 

1 

0 

0 

0 

0 

Total Flag 

10 

8 

9 

9 

19 

: 18 ; 

17 

19 

2 

8 

5 

4 

Flag at >= 

1 

0 

0 

0 

0 

0 

4 

5 

2 

0 

0 

0 

2.0 
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Conclusion 

For test fairness, equating practices should ensure that test scores fronn different administrations 
are equivalent. The current study provides a useful tool for the local evaluation of the equity 
of test scores. The current study shows the limitations of the discrepancy indices that are 
often used to assess the adequacy of an equating method, and suggests that bootstrap 
intervals can be usefully used as an alternative method to assess the adequacy of the various 
equating methods. The approach is particularly helpful for high-stakes assessments where the 
determination of cut scores takes on great importance and has implications for practice. 

Although the results of the bootstrap test revealed great utility to assess equity properties, 
it is also important to notice that the results are limited to the specific sample data from 
the pre-2015 PSAT/NMSQT test administration in which the ability difference between new 
and old forms was relative large due to the difference in the target populations. In addition, 
the equating design for the current study was the common nonequivalent design. Thus, 
simulation studies that include different equating designs and different levels of examinee 
ability differences (between old and new forms) should be examined in future studies. 
Sensitivity analysis can be conducted to assess the efficacy of the bootstrap test. 
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